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Abstract 

This paper describes the conversion of a Hid- 
den Markov Model into a finite state trans- 
ducer that closely approximates the behavior 
of the stochastic model. In some cases the 
transducer is equivalent to the HMM. This 
conversion is especially advantageous for part- 
of-speech tagging because the resulting trans- 
ducer can be composed with other transducers 
that encode correction rules for the most fre- 
quent tagging errors. The speed of tagging is 
also improved. The described methods have 
been implemented and successfully tested. 

1 Introduction 

This paper presents an algorithm^ which approxi- 
mates a Hidden Markov Model (HMM) by a finite- 
state transducer (FST). We describe one applica- 
tion, namely part-of-speech tagging. Other poten- 
tial applications may be found in areas where both 
HMMs and finite-state technology are applied, such 
as speech recognition, etc. The algorithm has been 
fully implemented. 

An HMM used for tagging encodes, like a trans- 
ducer, a relation between two languages. One lan- 
guage contains sequences of ambiguity classes ob- 
tained by looking up in a lexicon all words of a sen- 
tence. The other language contains sequences of tags 
obtained by statistically disambiguating the class se- 
quences. From the outside, an HMM tagger behaves 
like a sequential transducer that deterministically 
maps every class sequence to a tag sequence, e.g.: 

[DET,PRD] [ADJ, NOUN] [ADJ, NOUN] [END] 

DET ADJ NOUN END ^ ' 



^There are other (different) algorithms for HMM 
to FST conversion: An unpublished one by Julian M. 
Kupiec and John T. Maxwell (p.c), and n-type and s- 
type approximation by Kempe (1997). 



The main advantage of transforming an HMM is 
that the resulting transducer can be handled by fi- 
nite state calculus. Among others, it can be com- 
posed with transducers that encode: 

• correction rules for the most frequent tagging 
errors which are automatically generated (Brill, 
1992; Roche and Schabes, 1995) or manually 
written (Chanod and Tapanainen, 1995), in or- 
der to significantly improve tagging accuracy^. 
These rules may include long-distance depen- 
dencies not handled by HMM taggers, and can 
conveniently be expressed by the replace oper- 
ator (Kaplan and Kay, 1994; Karttunen, 1995; 
Kempe and Karttunen, 1996). 

• further steps of text analysis, e.g. light parsing 
or extraction of noun phrases or other phrases 
(Ait-Mokhtar and Chanod, 1997). 

These compositions enable complex text analysis to 
be performed by a single transducer. 

The speed of tagging by an FST is up to six times 
higher than with the original HMM. 

The motivation for deriving the FST from an 
HMM is that the HMM can be trained and con- 
verted with little manual effort. 

An HMM transducer builds on the data (probabil- 
ity matrices) of the underlying HMM. The accuracy 
of this data has an impact on the tagging accuracy 
of both the HMM itself and the derived transducer. 
The training of the HMM can be done on either a 
tagged or untagged corpus, and is not a topic of this 
paper since it is exhaustively described in the liter- 
ature (Bahl and Mercer, 1976; Church, 1988). 

An HMM can be identically represented by a 
weighted FST in a straightforward way. We are, 
however, interested in non-weighted transducers. 

^Automatically derived rules require less work than 
manually written ones but are unlikely to yield better 
results because they would consider relatively limited 
context and simple relations only. 



2 b-Type Approximation 

This section presents a method that approximates 
a (first order) Hidden Markov Model (HMM) by a 
finite-state transducer (FST), called b-type approxi- 
mation^. Regular expression operators used in this 
section are explained in the annex. 

Looking up, in a lexicon, the word sequence of a 
sentence produces a unique sequence of ambiguity 
classes. Tagging the sentence by means of a (first 
order) HMM consists of finding the most probable 
tag sequence T given this class sequence C (eq. 1, 
fig. 1). The joint probability of the sequences C and 
T can be estimated by: 

p{C,T) = p{ci....Cn,ti....tn) = 
n 

7r(ii) b{ci\ti) • JJa(ii|ti_i) b{ci\ti) (2) 

i=2 



2.1 Basic Idea 

The determination of a tag of a particular word can- 
not be made separately from the other tags. Tags 
can influence each other over a long distance via 
transition probabilities. 

In this approach, an ambiguity class is disam- 
biguated with respect to a context. A context con- 
sists of a sequence of ambiguity classes limited at 
both ends by some selected tag^. For the left con- 
text of length /? wc use the term look-back, and for 
the right context of length a we use the term look- 



ahead. 
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Figure 1: Disambiguation of classes between 
two selected tags 



In figure 1, the tag tf can be selected from the class 

Ci because it is between two selected tags^ which are 
t\_2 at a look-back distance of = 2 and tf^2 at 

^Name given by the author, to distinguish the algo- 
rithm from n-type and s-type approximation (Kempe, 
1997). 

^The algorithm is explained for a first order HMM. In 

the case of a second order HMM, b-type sequences must 
begin and end with two selected tags rather than one. 



a look- ahead distance of a = 2. Actually, the two 
selected tags tl_2 and t|^2 allow not only the disam- 
biguation of the class Cj but of all classes inbetween, 
i.e. Ci_i, Ci and q+i. 

We approximate the tagging of a whole sentence 
by tagging subsequences with selected tags at both 
ends (fig. 1); and then overlapping them. The most 
probable paths in the tag space of a sentence, i.e. 
valid paths according to this approach, can be found 
as sketched in figure 2. 

Wj Wj W4 W5 Wj W7 Wg words 

^ C2 Cj Cg Cy Cg ^ classes 




Figure 2: Two valid paths through the tag 
space of a sentence 



Wi W2 W3 W4 W5 Wj W7 Wg words 
^ C2 Cj Cg C'y Cg ^ classes 




Figure 3: Incompatible sequences in the tag 
space of a sentence 

A valid path consists of an ordered set of overlap- 
ping sequences in which each member overlaps with 
its neighbour except for the first or last tag. There 
can be more than one valid path in the tag space 
of a sentence (fig. 2). Sets of sequences that do not 
overlap in such a way are incompatible according to 
this model, and do not constitute valid paths (fig. 3). 

2.2 b-Type Sequences 

Given a length (3 of look-back and a length a of look- 
ahead, we generate for every class cq, every look- 
back sequence t-^ c_/3+i ... c_i, and every look- 
ahead sequence ci ... Ca-i ta, a b-type sequence^: 

t-l3 C-0+1 ... C_i Co Ci ... Ca-1 ta (3) 



For example: 

CONJ [DET,PRON] [ADJ, NOUN, VERB] [NOUN, VERB] VERB (4) 

Each such original b-type sequence (eq. 3,4; fig. 4) 
is disambiguated based on a first order HMM. Here 
we use the Viterbi algorithm (Viterbi, 1967; Ra- 
biner, 1990) for efficiency. 
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Figure 4: b-Type sequence 

For an original b-type sequence, the joint proba- 
bility of its class sequence C with its tag sequence T 
(fig. 4), can be estimated by: 

p(C,T) =p{c_p+i ... Cc-l , i_/3 ... ia) = 



Yl a{ti\ti-i) b{ci\ti) 
=-0+1 



a{ta\ta-l) (5) 



At every position in the look-back sequence and 

in the look-ahead sequence, a boundary ^ may oc- 
cur, i.e. a sentence beginning or end. No look-back 
(/? = 0) or no look-ahead {a = 0) is also allowed. 
The above probability estimation (cq. 5) can then 
be expressed more generally (fig. 4) as: 



p{C, T) = Pstart ■ Pmiddle ' Vend 

with Pstart being 



(6) 



Pstart = a{t-p+i\t-p) for selected tag (7) 



Pstart =7I'(t-/3-|-l) 
Pstart = 1 

with Pmiddle being 



for boundary # 
for/3 = 



(8) 
(9) 



Oi-l 



Pmiddle =i>{C-i3+i\t-p+i)- JJa(tj|ti_i) h{Ci\ti) 

i=-0+2 

fora-F/3>0 (10) 
Pmiddle = ^(co|io) fora-|-/3 = (11) 



and with p^nd being 

Pend = (i{ta\ta-i) for Selected tag to- (12) 

Pend — 

1 for boundary or a = (13) 

When the most likely tag sequence is found for an 

original b-typc sequence, the class cq in the middle 
position (eq. 3) is associated with its most likely tag 
to. We formulate constraints for the other tags t-/) 
and ta and classes c_^i-|-i...c_i and ci...Ca_i of the 
original b-type sequence. Thus we obtain a tagged 
b-type sequence ^: 

c-fc' ^-J ^o:to cf cf ...ctr' C (14) 

stating that to is the most probable tag in the class 
Co if it is preceded by t'^'^ c^^^-'^\..c^'^ c^^ and 

followed by C^^ C^2_._(,A(a-l) ^Aa_ 

In expression 14 the subscripts 

—(} —f3+1...0...a—l a denote the position of the tag 
or class in the b-type sequence, and the superscripts 
Bp B{f3-1)...B1 and Al...A{a-l) Aa express con- 
straints for preceding and following tags and classes 
which are part of other b-type sequences. In the 
example^: 

C0NJ-B2 [DET,PR0N]-B1 

[ADJ, NOUN, VERB]: ADJ 

[NOUN, VERB]-A1 VERB-A2 (15) 

ADJ is the most likely tag in the class 
[AD J, NOUN, VERB] if it is preceded by the tag CONJ 
two positions back (B2), by the class [DET.PRON] 
one position back (Bl), and followed by the class 
[NOUN, VERB] one position ahead (Al) and by the 
tag VERB two positions ahead (A2). 

Boundaries are denoted by a particular symbol # 
and can occur at the edge of the look-back and look- 
ahead sequence: 

t"^ ^Bl ^Al ^Al (^g^ 

A(c.-1) (^7) 

(18) 
(19) 



15A £Si , Al Al u. 

c c c:t c c 



c^('^-i) ...c^' c^^ c:t 



c:t 



c:t c^^ ...c^("-i' (20) 

For example: 

#-B2 [DET, PR0N]-B1 

[ADJ, NOUN, VERB]: ADJ 

[N0UN,VERB]-A1 VERB-A2 (21) 



^Regular expression operators used in this article are 
explained in the annex. 



C0NJ-B2 [DET, PR0N]-B1 

[ADJ, NOUN, VERB]: NOUN 
#-Al 



(22) 



Note that look-back of length (3 and look-ahead of 
length a also include all sequences shorter than (3 or 
a, respectively, that are limited by #. 

For a given length /3 of look-back and a length a 
of look-ahead, we generate every possible original b- 
type sequence (cq. 3), disambiguate it statistically 
(eq. 5-13), and encode the tagged b-type sequence 
Bi (eq. 14) as an FST. All sequences are then 
unioned 

^B = |jB, (23) 



and we generate a preliminary tagger model B' 



B' = rB 



(24) 



where all sequences Bi can occur in any order 
and number (including zero times) because no con- 
straints have yet been applied. 

2.3 Concatenation Constraints 

To ensure a correct concatenation of sequences J5j, 
we have to make sure that every Bi is preceded and 
followed by other Bi according to what is encoded 
in the look-back and look-ahead constraints. E.g. 
the sequence in example (21) must be preceded by 
a sentence beginning, and the class [DET.PRDN] 
and followed by the class [NOUN, VERB] and the tag 
VERB. 

We create constraints for preceding and following 
tags, classes and sentence boundaries. For the look- 
back, a particular tag ti or class Cj is required for a 
particular distance of 5 < — 1, by^: 



R\U) 



-[-[?* u m rt [r*]*i'H-i)] ^f^" ?*i (25) 



R\c,) =~[^[?* c, [Vc]* [rc]*]'(-5-l)] cf-^' ?*] (26) 

for 5<-l 

with 4: and being the union of all tags and all 
classes respectively. 

A sentence beginning, ^, is required for a partic- 
ular look-back distance of 5< — 1, on the side of the 
tags, by: 

R\#)=[~[ [\> ?*] (27) 

for <5 < -1 



In the case of look-ahead we require for a partic- 
ular distance of (5 > 1, a particular tag ti or class Cj 
or a sentence end, #, on the side of the tags, in a 
similar way by: 



(28) 



R\c,) =~[ ?* cf ~[ [Vc]* [^c]*r(i-\) c, ?*] ] (29) 



R\#) 



[?* rt[\>]"(i-i)]] 



for <5 > 1 



(30) 



All tags ti are required for the look-back only at 
the distance oi 5 = —(3 and for the look-ahead only 
at the distance of 5 = a. All classes Cj are required 
for distances of (5 £ [— /? +1,-1] and 5 G [1, a — 1]. 
Sentence boundaries, are required for distances 
of 5 G [-13,-1] and 5 G [l,a]. 

We create the intersection Rt of all tag con- 
straints, the intersection R,, of all class constraints, 
and the intersection R^ of all sentence boundary 
constraints: 



i?t = fl R\ti) 



5 e {-/3,a} 



Rc — 



n R'i^j) 



j e [l.m] 
S e [-/3-|-l,-l]U[l,a-l] 



R# 



n ^'(#) 



(31) 



(32) 



(33) 



S e [-/3,-l]U[l,a] 



All constraints are enforced by composition with 
the preliminary tagger model B' (eq. 24). The class 
constraint Rc is composed on the upper side of B' 
which is the side of the classes (eq. 14), and both 
the tag constraint Rt and the boundary constraint^ 
R^ are composed on the lower side of B\ which is 
the side of the tags^: 



B" = Rc 



(34) 



.0. B' .0. Rt .0. i?# 

Having ensured correct concatenation, we delete 
all symbols r that have served to constrain tags, 
classes or boundaries, using Dr-: 



r = 



(35) 



^The boundary constraint could alternatively be 
computed for and composed on the side of the classes. 
The transducer which encodes i?# would then, however, 
be bigger because the number of classes is bigger than 
the number of tags. 



Dr = r->[] (36) 

By composing'^ B" (eq. 34) on the lower side with 
Dj. and on the upper side with the inverted relation 
£>r-i) we obtain the final tagger model B: 

B = Dr.i .0. B" .0. Dr (37) 

We call the model a b-type model, the correspond- 
ing FST a b-type transducer, and the whole algo- 
rithm leading from the HMM to the transducer, a 
b-type approximation of an HMM. 

2.4 Properties of b-Type Transducers 

There are two groiips of b-type transducers with dif- 
ferent properties: FSTs without look-back and/or 
without look-ahead {/3-a = 0) and FSTs with both 
look-back and look-ahead {fi-a > 0). Both accept 
any sequence of ambiguity classes. 

b-Type FSTs with /3-a = are always sequential. 
They map a class sequence that corresponds to the 
word sequence of a sentence, always to exactly one 
tag sequence. Their tagging accuracy and similarity 
with the underlying HMM increases with growing 
(3 + a. A b-typc FST with (3 = and a = is equiva- 
lent to an nO-type FST, and with (3=1 and a = it 
is equivalent to an nl-type FST (Kempe, 1997). 

b-Type FSTs with /3-a > are in general not se- 
quential. For a class sequence they deliver a set of 
different tag sequences, which means that the tag- 
ging results are ambiguous. This set is never empty, 
and the most probable tag sequence according to the 
underlying HMM is always in this set. The longer 
the look-back distance (3 and the look-ahead distance 
a are, the larger the FST and the smaller the set of 
resulting tag sequences. For sufficiently large (3-\-a, 
this set may contain always only one tag sequence. 
In this case the FST is equivalent to the underlying 
HMM. For reasons of size however, this FST may 
not be computable for particular HMMs (sec. 4). 

3 An Implemented Finite-State Tagger 

The implemented tagger requires three transducers 
which represent a lexicon, a guesser and an approx- 
imation of an HMM mentioned above. 

Both the lexicon and guesser are sequential, i.e. 
deterministic on the input side. They both unam- 
biguously map a surface form of any word that they 
accept to the corresponding ambiguity class (fig. 5, 
col. 1 and 2): First of all, the word is looked for in the 

'^For efficiency reasons, we actually do not delete the 
constraint symbols r by composition. We rather trar- 
verse the network, and overwrite every symbol r with 
the empty string symbol e. In the following determiniza- 
tion of the network, all e are eliminated. 



lexicon. If this fails, it is looked for in the guesser. If 
this equally fails, it gets the label [UNKNOWN] which 
denotes the ambiguity class of unknown words. Tag 
probabilities in this class are approximated by tags 
of words that appear only once in the training cor- 
pus. 

As soon as an input token gets labeled with the 
tag class of sentence end symbols (fig. 5: [SENT]), 
the tagger stops reading words from the input. At 
this point, the tagger has read and stored the words 
of a whole sentence (fig. 5, col. 1) and generated the 
corresponding sequence of classes (fig. 5, col. 2). 

The class sequence is now mapped to a tag se- 
quence (fig. 5, col. 3) using the HMM transducer. A 
b-type FST is not sequential in general (sec. 2.4), 
so to obtain a unique tagging result, the finite-state 
tagger can be run in a special mode, where only the 
first result found is retained, and the tagger does 
not look for other results*^. Since paths through an 
FST have no particular order, the result retained is 
random. 

The tagger outputs the stored word and tag se- 
quence of the sentence, and continues in the same 
way with the remaining sentences of the corpus. 
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Figure 5: Tagging a sentence 



The tagger can be run in a statistical mode where 
the number of tag sequences found per sentence is 
counted. These numbers give an overview of the 
degree of non-sequentiality of the concerned b-type 
transducer (sec. 2.4). 



This mode of retaining the first result only is not 
necessary with n-type and s-type transducers which are 
both sequential (Kempe, 1997). 
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Language: English 

Corpora: 19 944 words for HMM training, 19 934 words for test 

Tag set: 36 tags, 181 classes 

* Multiple, i.e. ambiguous tagging results: Only first result retained 

Types of FST (Finite-State Transducers) : 

nO, nl n-type transducers (Kcnipe, 1997) 

s-l-nl (1M,F8) s-type transducer (Kcmpe, 1997), 

with subsequences of frequency > 8, from a training corpus 
of 1 000 000 words, completed with nl-type 

b (/3=2,a=l) b-type transducer (sec. 2), with look-back of 2 and look-ahead of 1 
Computers: 

ultra2 1 CPU, 512 MBytes physical RAM, 1.4 GBytes virtual RAM 

sparc20 1 CPU, 192 MBytes physical RAM, 827 MBytes virtual RAM 



Table 1: Accuracy, speed, size and creation time of some HMM transducers 



4 Experiments and Results 

This section compares different FSTs with each 
other and with the original HMM. 

As expected, the FSTs perform tagging faster 
than the HMM. 

Since all FSTs are approximations of HMMs, they 
show lower tagging accuracy than the HMMs. In the 
case of FSTs with /?> 1 and a = l, this difference in 
accuracy is negligible. Improvement in accuracy can 
be expected since these; FSTs can be composed with 
FSTs encoding correction rules for frequent errors 
(sec. 1). 

For all tests below an English corpus, lexicon and 
guesser were used, which were originally annotated 
with 74 different tags. We automatically recoded the 
tags in order to reduce their number, i.e. in some 
cases more than one of the original tags were recoded 
into one and the same new tag. We applied different 
recodings, thus obtaining English corpora, lexicons 
and guessers with reduced tag sets of 45, 36, 27, 18 
and 9 tags respectively. 

FSTs with /3 = 2 and a = 1 and with /3 = 1 and 
a = 2 were equivalent, in all cases where they could 
be computed. 



Table 1 compares different FSTs for a tag set of 

36 tags. 

The b-type FST with no look-back and no look- 
ahead which is equivalent to an nO-type FST 
(Kempe, 1997), shows the lowest tagging accuracy 
(b-FST (/? = 0,a = 0): 87.21 %). It is also the small- 
est transducer (1 state and 181 arcs, as many as 
tag classes) and can be created faster than the other 
FSTs (6 sec). 

The highest accuracy is obtained with a b-type 
FST with /3 = 2 and a = 1 (b-FST (/3 = 2,a=l): 
97.34 %) and with an s-type FST (Kempe, 1997) 
trained on 1 000 000 words (s-|-nl-FST (IM, Fl): 
97.33 %). In these two cases the difference in accu- 
racy with respect to the underlying HMM (97.35 %) 
is negligible. In this particular test, the s-type FST 
comes out ahead because it is considerably smaller 
than the b-type FST. 

The size of a b-type FST increases with the size 
of the tag set and with the length of look-back plus 
look-ahead, (3+a. Accuracy improves with growing 
13+a. 

b-Type FSTs may produce ambiguous tagging re- 
sults (sec. 2.4). In such instances only the first result 
was retained (sec. 3). 
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Language: English 

Corpora: 19 944 words for HMM training, 19 934 words for test 
Types of FST (Finite-State Transducers) cf. table 1 
* Multiple, i.e. ambiguous tagging results: Only first result retained 




97.06 
99.98 


Tagging accuracy of 97.06 %, 

and agreement of FST with HMM tagging results of 99.98 % 












Transducer could not be computed, for reasons of size. 





Table 2: Tagging accuracy and agreement of the FST tagging results with those 
of the underlying HMM, for tag sets of different sizes 



Tabic 2 shows the tagging accuracy and the agree- 
ment of the tagging results with the results of the 
underlying HMM for different FSTs and tag sets of 
different sizes. 

To get results that are almost equivalent to those 
of an HMM, a b-type FST needs at least a look-back 
of /? = 2 and a look-ahead of a = 1 or vice versa. 
For reasons of size, this kind of FST could only be 
computed for tag sets with 36 tags or less. A b-type 
FST with /3 = 3 and a = 1 could only be computed 
for the tag set with 9 tags. This FST gave exactly 
the same tagging results as the underlying HMM. 

Table 3 illustrates which of the b-type FSTs are 
sequential, i.e. always produce exactly one tagging 
result, and which of the FSTs are non-sequential. 

For all tag sets, the FSTs with no look-back 



(/? — 0) and/or no look-ahead {a = 0) behaved se- 
quentially. Here 100 % of the tagged sentences had 
only one result. Most of the other FSTs {(3-a>Q) 
behaved non-sequentially. For example, in the case 
of 27 tags with /3 = 1 and a = 1, 90.08 % of the 
tagged sentences had one result, 9.46 % had two re- 
sults, 0.23 % had tree results, etc. 

Non-sequentiality decreases with growing look- 
back and look-ahead, (i + a, and should completely 
disappear with sufficiently large f3+a. Such b-type 
FSTs can, however, only be computed for small tag 
sets. We could compute this kind of FST only for 
the case of 9 tags with /3 = 3 and a = l. 

The set of alternative tag sequences for a sentence, 
produced by a b-type FST with 0-a > 0, always 
contains the tag sequence that corresponds with the 
result of the underlying HMM. 



Transchitx^r 


Sentences with n tagging results 

(in %) 


n= 1 


n= 2\n= ;i|/i= 4| -5-8 1 9-16 


74 tags, 297 classes (original tag set) 


b-FST (/3-a = 0) 


100 












b-FST (/3=l,a=l) 
b-FST (/3=2,a=l) 


75.14 
I 


20.18 
"ST w 


0.34 
as not 


3.42 
comp 


0.80 
utabl 


0.11 


45 tags, 214 classes (reduced tag set) 


b-FST (/3-a = 0) 


100 












b-FST (/3=l,a=l) 
b-FST 1/3=2, a=l) 


75.71 
I 


19.73 

"ST w 


0.68 

as not 


3.19 
comp 


0.68 
utabl( 




36 tags, 181 classes (reduced tag set) 


b-FST (/3-a = 0) 


100 












b-FST 1/3=1, a=l) 
b-FST 1/3=2, a=l) 


78.56 
99.77 


17.90 
0.23 


0.34 


2.85 


0.34 




27 tags, 119 classes (reduced tag set) 


b-FST (/3-a = 0) 


100 












b-FST (/3=l,a=l) 

b-FST 1/3=2, a=l) 


90.08 

99.77 


9.46 
0.23 


0.23 


0.11 


0.11 




18 tags, 97 classes (reduced tag set) 


b-FST {/3-a = 0) 


100 












b-FST 1/3=1, a=l) 
b-FST 1/3=2, a=l) 


93.04 
99.89 


6.84 
0.11 




0.11 






9 tags, 67 classes (reduced tag sot) 


b-FST {/3-a = 0) 


100 












b-FST {P=l,a=l) 
b-FST 1/3=2, a=l) 
b-FST (/3=3,a=l) 


86.66 
99.77 
100 


12.43 
0.23 




0.91 






Language: English 

Test corpus: 19 934 words, 877 sentences 

Types of FST (Finite-State Transducers) of. table 1 



Table 3: Percentage of sentences with a par- 
ticular number of tagging results 



5 Conclusion and Future Research 

The algorithm presented in this paper describes the 
construction of a finite-state transducer (FST) that 
approximates the behaviour of a Hidden Markov 
Model (HMM) in part-of-speech tagging. 

The algorithm, called b-typc approximation, uses 
look-back and look-ahead of freely selectable length. 

The size of the FSTs grows with both the size of 
the tag set and the length of the look-back plus look- 
ahead. Therefore, to keep the FST at a computable 
size, an increase in the length of the look-back or 
look-ahead, requires a reduction of the number of 
tags. In the case of small tag sets (e.g. 36 tags), the 
look-back and look-ahead can be sufficiently large 
to obtain an FST that is almost equivalent to the 
original HMM. 

In some tests s-type FSTs (Kcmpe, 1997) and 
b-type FSTs reached equal tagging accuracy. In 
these cases s-type FSTs are smaller because they 
encode the most frequent ambiguity class sequences 



of a training corpus very accurately and all other 
sequences less accurately. b-Type FSTs encode all 
sequences with the same accuracy. Therefore, a 
b-typc FST can reach equivalence with the original 
HMM, but an s-type FST cannot. 

The algorithms of both conversion and tagging are 
fully implemented. 

The main advantage of transforming an HMM is 
that the resulting FST can be handled by finite state 
calculus^ and thus be directly composed with other 
FSTs. 

The tagging speed of the FSTs is up to six times 
higher than the speed of the original HMM. 

Future research will include the composition of 

HMM transducers with, among others: 

• FSTs that encode correction rules for the most 
frequent tagging errors in order to significantly 
improve tagging accuracy (above the accuracy 
of the underlying HMM). These rules can ei- 
ther be extracted automatically from a corpus 
(Brill, 1992) or written manually (Chanod and 
Tapanaincn, 1995). 

• FSTs for light parsing, phrase extraction and 
other text analysis (Ai't-Mokhtar and Chanod, 
1997). 

An HMM transducer can be composed with one 
or more of these FSTs in order to perform complex 
text analysis by a single FST. 

ANNEX: Regular Expression Operators 

Below, a and b designate symbols, A and B designate 

languages, and R and Q designate relations between 
two languages. More details on the following 
operators and pointers to finite-state literature can 
be found in 

http: //www. xrce .xerox. com/research/mltt/f st 

~A Complement (negation). Sot of all strings 

except those from the language A. 

\a Term complement. Any symbol other 

than a. 

A* Kleene star. Language A zero or more 

times concatenated with itself. 

A"n An times. Language A n times concate- 

nated with itself. 

a -> b Replace. Relation where every a on the 
upper side gets mapped to a b on the 
lower side. 



large library of finite-state functions is available 
at Xerox. 



a : b Symbol pair with a on the upper and b 

on the lower side. 

R . i Inverse relation where both sides are ex- 

changed with respect to R. 

A B Concatenation of all strings of A with all 
strings of B. 

R .0. Q Composition of the relations R and Q. 
or [] Empty string {epsilon). 
? Any symbol in the known alphabet and 

its extensions 

Acknowledgements 

I am grateful to all colleagues who helped me, par- 
ticularly to Lauri Karttunen (XRCE Grenoble) for 
extensive discussion, and to Julian Kupiec (Xerox 
PARC) for sending me information on his own re- 
lated work. Many thanks to Irene Maxwell for cor- 
recting various versions of the paper. 

References 

Ai't-Mokhtar, Salah and Chanod, Jean-Pierre 
(1997). Incremental Finite-State Parsing. In the 
Proceedings of the 5th Conference of Applied Nat- 
ural Language Processing (ANLP). ACL, pp. 72- 
79. Washington, DC, USA. 

Bahl, Lalit R. and Mercer, Robert L. (1976). Part 
of Speech Assignment by a Statistical Decision Al- 
gorithm. In IEEE international Symposium on 
Information Theory, pp. 88-89. Ronneby. 

Brill, Eric (1992). A Simple Rule-Based Part-of- 
Speech Tagger. In the Proceedings of the 3rd con- 
ference on Applied Natural Language Processing, 
pp. 152-155. Trento, Italy 

Chanod, Jean-Pierre and Tapanainen, Pasi (1995). 
Tagging French - Comparing a Statistical and a 
Constraint Based Method. In the Proceedings of 
the 7th conference of the EACL, pp. 149-156. 
ACL. Dublin, Ireland, cmp-lg/9503003 

Church, Kenneth W. (1988). A Stochastic Parts 
Program and Noun Phrase Parser for Unrestricted 
Text. In Proceedings of the 2nd Conference on 
Applied Natural Language Processing. ACL, pp. 
136-143. 

Kaplan, Ronald M. and Kay, Martin (1994). Regu- 
lar Models of Phonological Rule Systems. In Com- 
putational Linguistics. 20:3, pp. 331-378. 

Karttunen, Lauri (1995). The Replace Operator. In 
the Proceedings of the 33rd Annual Meeting of the 
Association for Computational Linguistics. Cam- 
bridge, MA, USA. cmp-lg/9504032 



Kempe, Andre and Karttunen, Lauri (1996). Par- 
allel Replacement in Finite State Calculus. In 

the Proceedings of the 16th International Confer- 
ence on Computational Linguistics, pp. 622-627. 
Copenhagen, Denmark, cmp-lg/9607007 

Kempe, Andre (1997). Finite State Transducers Ap- 
proximating Hidden Markov Models. In the Pro- 
ceedings of the 35th Annual Meeting of the Associ- 
ation for Computational Linguistics, pp. 460-467. 
Madrid, Spain, cmp-lg/9707006 

Rabiner, Lawrence R. (1990). A Tutorial on Hid- 
den Markov Models and Selected Applications in 
Speech Recognition. In Readings in Speech Recog- 
nition (cds. A. Waibel, K.F. Lee). Morgan Kauf- 
mann Publishers, Inc. San Mateo, CA., USA. 

Roche, Emmanuel and Schabes, Yves (1995). Deter- 
ministic Part-of-Speech Tagging with Finite-State 
Transducers. In Computational Linguistics. Vol. 
21, No. 2, pp. 227-253. 

Viterbi, A.J. (1967). Error Bounds for Convolu- 

tional Codes and an Asymptotical Optimal De- 
coding Algorithm. In Proceedings of IEEE, vol. 
61, pp. 268-278. 



